{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M4.03\n",
    "\n",
    "Now, we tackle a (relatively) realistic classification problem instead of making\n",
    "a synthetic dataset. We start by loading the Adult Census dataset with the\n",
    "following snippet. For the moment we retain only the **numerical features**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "target = adult_census[\"class\"]\n",
    "data = adult_census.select_dtypes([\"integer\", \"floating\"])\n",
    "data = data.drop(columns=[\"education-num\"])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We confirm that all the selected features are numerical.\n",
    "\n",
    "Define a linear model composed of a `StandardScaler` followed by a\n",
    "`LogisticRegression` with default parameters.\n",
    "\n",
    "Then use a 10-fold cross-validation to estimate its generalization performance\n",
    "in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
    "the trained estimators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the most important feature seen by the logistic regression?\n",
    "\n",
    "You can use a boxplot to compare the absolute values of the coefficients while\n",
    "also visualizing the variability induced by the cross-validation resampling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now work with **both numerical and categorical features**. You can\n",
    "reload the Adult Census dataset with the following snippet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "target = adult_census[\"class\"]\n",
    "data = adult_census.drop(columns=[\"class\", \"education-num\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a predictive model where:\n",
    "- The numerical data must be scaled.\n",
    "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
    "  group categories concerning less than 1% of the total samples.\n",
    "- The predictor is a `LogisticRegression` with default parameters, except that\n",
    "  you may need to increase the number of `max_iter`, which is 100 by default.\n",
    "\n",
    "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
    "above to evaluate the full pipeline, including the feature scaling and encoding\n",
    "preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By comparing the cross-validation test scores of both models fold-to-fold,\n",
    "count the number of times the model using both numerical and categorical\n",
    "features has a better test score than the model using only numerical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the following questions, you can copy and paste the following snippet to\n",
    "get the feature names from the column transformer here named `preprocessor`.\n",
    "\n",
    "```python\n",
    "preprocessor.fit(data)\n",
    "feature_names = (\n",
    "    preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n",
    "        categorical_columns\n",
    "    )\n",
    ").tolist()\n",
    "feature_names += numerical_columns\n",
    "feature_names\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that there are as many feature names as coefficients in the last step\n",
    "of your predictive pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which of the following pairs of features is most impacting the predictions of\n",
    "the logistic regression classifier based on the absolute magnitude of its\n",
    "coefficients?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now create a similar pipeline consisting of the same preprocessor as above,\n",
    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
    "enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
    "engineering step. Remember not to include a \"bias\" feature to avoid\n",
    "introducing a redundancy with the intercept of the subsequent logistic\n",
    "regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the same 10-fold cross-validation strategy as above to evaluate this\n",
    "pipeline with interactions. In this case there is no need to return the\n",
    "estimator, as the number of features generated by the `PolynomialFeatures` step\n",
    "is much too large to be able to visually explore the learned coefficients of the\n",
    "final classifier.\n",
    "\n",
    "By comparing the cross-validation test scores of both models fold-to-fold,\n",
    "count the number of times the model using multiplicative interactions and both\n",
    "numerical and categorical features has a better test score than the model\n",
    "without interactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}